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Abstract 

The pseudofermion action of the Hybrid Monte Carlo (HMC) algorithm for dynam- 
ical fermions is modified to directly incorporate Incomplete LU (ILU) factorisation. 
This reduces the stochastic noise and allows a larger molecular dynamics step-size 
OO ■ to be taken, cutting the computational cost. Numerical tests using the two-flavour 

Schwinger model are presented, where a two-step ILU preconditioning of the even-odd 
fermion matrix allows the step-size to be increased by a factor of two over the standard 
even-odd formulation. 

o 



1 Introduction 



Monte Carlo integration of the partition function of QCD with light quarks remains 
a computationally demanding task. At present, Hybrid Monte Carlo (HMC) Q| is 
amongst the most widely used algorithms for generating an ensemble of gauge field 
configurations with the dynamical QCD probability distribution. This exact algo- 
rithm combines molecular dynamics evolution in a fictitious simulation time, with a 
Metropolis test to ensure detailed balance. The effects of dynamical Wilson quark 
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fields are introduced using gaussian-distributed "pseudofermion" fields and most of 
the computational effort goes into inverting the fermion matrix at each step of the 
molecular-dynamics trajectory. As a result, attention has focussed on improving the 
algorithms for this large, sparse matrix inversion. 

ILU preconditioning schemes are commonly used to accelerate iterative inverters. 
For an interaction matrix, this translates into first ordering the sites on the lattice, 
then decomposing the matrix into upper and lower components. The upper segment 
couples a site only to those neighbours with a higher ordering index, and similarly the 
lower matrix couples to the lower-ordered sites. The preconditioning matrices are then 
constructed from these two terms. In one highly efficient inversion method for parallel 
machines, the SSOR scheme 0], a "locally-lexicographic" ILU preconditioning is used 
and the fermion matrix is subsequently inverted using BiCGStab. In Ref. the 
commonly used even-odd (or "red-black" ) scheme was recognised as an ILU decompo- 
sition, but not the optimal one. Even-odd (eo) preconditioning of the pseudofermion 
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coupling matrix has also been used |3|] to reduce the stochastic noise in molecular 
dynamics evolution, and leads to an increase in the acceptance rate of the Metropolis 
test. In this paper, it is first noted that, following the even-odd example, any ILU 
preconditioning can be applied directly to the matrix appearing in the pseudofermion 
action. Beyond this, a simple two-step scheme is presented, where the matrix is first 
even-odd preconditioned and the sites on one sub-lattice are subsequently ordered 
and ILU-factorised again. The global-lexicographic ordered version of this two-step 
scheme was proposed in Sec. 4 of Ref. [Q. In this work, other ordering schemes are 
tested. The two-step method leads to further improvements in the solver convergence 
rate and when used in the pseudofermion action, the HMC algorithm performance is 
also significantly enhanced. 

The paper is organised as follows; Sec. |^ implements ILU preconditioning in the 
pseudofermion action and presents the two-level scheme, then in Sec. [|, these algo- 
rithms are tested in simulations of the Schwinger model. Here, using the two-step 
ILU preconditioned even-odd (eo-ILU) fermion matrix, a performance improvement 
of a factor of about two is found compared to the standard even-odd preconditioning. 
Sec. |] briefly discusses application of the method in large-scale simulations on parallel 
computers and which use improved fermion actions. 



2 Preconditioning the pseudofermions in HMC. 

For simulations of a gauge theory with two degenerate flavours of dynamical fermions, 
the partition function is 

Z= fw detM 2 e" S9 , (1) 



with S g the lattice Yang-Mills discretisation and det M the determinant of the fermion 
matrix. In the HMC algorithm, this determinant is re-expressed as a gaussian integral 
over new bosonic degrees of freedom; the "pseudofermions" 

detAftM= (vct>*V4> g-^K^rY ( 2 ) 

Notice that to ensure the gaussian integral is well defined, the number of fermion 
flavours simulated must be even (Nf = 2 is assumed throughout this work) and the 
75-hermiticity property of the Wilson fermion matrix has been used; ie 

M t =7 5 M7 5 , hence detM = detM f . (3) 
Then the partition function to be simulated is written as 

Z = fvUV<j)*V<j) e -s s -0*KM]-y ^ 



The Hybrid Monte Carlo algorithm generates a new element in the sequence of configu- 
rations in two stages. First, a fictitious continuous time variable, t is introduced along 
with canonical momentum variables, p M (x) conjugate to each of the gauge degrees of 
freedom, l/ M (x). A hamiltonian describing dynamics in the new time coordinate is 
introduced, 

H=^p 2 + S g + 4>* [MtM]"V, (5) 

and the system is evolved in r using a reversible, area-preserving integration scheme 
such as leap-frog. Since the integration scheme is inexact, the hamiltonian is not 
conserved and a subsequent stochastic step is added to compensate. After some time 
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interval, the new configuration is proposed as an element of the ensemble and accepted 
or rejected according to the Metropolis test on the change in the hamiltonian; 

P acc = min(l,e- m ). (6) 

The high computational cost of these dynamical fermion simulations arises as calcu- 
lating the force term acting on the conjugate momenta at each leap-frog step requires 
two inversions of the fermion matrix, M. 

An Incomplete LU (ILU) factorisation of M was demonstrated as an efficient means 
of accelerating inversion || ^ and has been used successfully in large-scale production 
runs by eg. the SESAM collaboration The ILU factorisation preconditions the 
fermion matrix, M by left and right multiplication with two readily invertible matrices; 

M=(I-L)- 1 M{I-U)- 1 (7) 

where L and U are the lower and upper parts of the Wilson hopping term. Defining 
L and U first requires the sites on the lattice, x are assigned an integer index, s(x) 
then the site ordering is defined as 

y > x if s(y) > s(x), (8) 

with y < x defined similarly. The lower part of the Wilson matrix is then 

L =( K S,. ^(^(i - T^^y.x+A + ^(x - + 7f.)*y,x-A when y < x 
xy I otherwise 

and U is defined similarly for sites where y > x. The full Wilson matrix is then 
equivalent to 

M = I - L - U. (10) 

Matrix inversion is accelerated since the new matrix, M is better conditioned than 
M. The preconditioning matrices, (I — L) and (7 — U) are easily inverted by either 
forward or backward substitution respectively. The 75-hermiticity (c/. Eqn. |3|) of the 
preconditioned matrix is preserved as 

Mt = 75 M 75 S i nce (I - L)^ = l5 (I - U)j 5 . (11) 

Matrix- vector operations proceed efficiently via the "Eisenstat trick" ]7l 0|; using 
Eqns. (0) and (|l(i|), M is re- written as 

m = (i - uy 1 + l)- 1 {1 -{i- uy 1 ) , (12) 

and the matrix operation is reduced to a backward substitution followed by a forward 
one. This requires approximately the same number of floating point operations as the 
original Wilson matrix- vector product. 

While the introduction of the pseudofermions to model the fermion determinant 
makes the molecular dynamics of the hamiltonian tractable, it also introduces extra 
randomness into this evolution. Simulations demonstrate that the Metropolis accep- 
tance rate is higher with the better-conditioned even-odd matrix. For this study, 
first note that any ILU decomposition can be applied directly to the determinant of 
dynamical fermion simulations; 

det M = det(J - Ly 1 det(7 - L - U) det(7 - Uy 1 = detM, (13) 

since det(7 — L) = det(7 — U) = 1. From this identity, the fermion determinant can 
be simulated using pseudofermions coupled via the preconditioned matrix, M 

Z= fvUV<j)*V<j) e -4>*K*ry (14) 
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The HMC algorithm can be applied to this new pseudofermionic partition function and 
the better conditioning of the matrix should lead to an improvement in the acceptance 
rate. The extent of this improvement will depend on the site ordering used, while 
physical expectation values computed on the ensemble will not. This will be tested in 
Sec. | 

2.1 Molecular dynamics for the preconditioned action. 

The new hamiltonian generating the molecular dynamics is 

H = \v 2 + S g + cj>* [MWl] ~ l 0, (15) 

In order to perform the molecular-dynamics updates, the force term acting on the 
momenta of the hamiltonian of Eqn. ( |l5|) must be determined. To demonstrate that 
the force term for the ILU preconditioned matrix, M is readily implemented, first note 
that the force arising from the gauge action is trivially left unchanged and consider 
the term arising from the pseudofermionic action alone, 

S pf = 0*[MtM]~V (16) 
The derivative of this action with respect to r, keeping <fi fixed is 

-<j)* [MtM]' 



dS pi 



dr 



M + M^—— 



dr dr 



[M^M] 0. (17) 



Introducing the auxiliary fields, Y — M* 1 <fi and X = M 1 Y, this becomes 

d Ai = -Y^X-X^Y. (18) 
dT dr dr 

At each leap-frog step, the fields Y and X must be recomputed; this is the section of the 
update requiring most of the computational effort. Note that the inversion proceeds 
more rapidly than in the original HMC algorithm as M is better conditioned. The 
derivative of M can be expanded after two new auxiliary fields are introduced; they 
are 

Y L = (I - Lt)- 1 y and X v = (I - Uy 1 X. (19) 

Note that constructing these fields requires only two backward substitutions (L^ is 
an upper-diagonal matrix), rather than an iterative method and so is an insignificant 
overhead compared to re-evaluating X and Y. With these new fields, and using the 
definition of M in Eqn. (Q), the derivative becomes 

dS p{ ( dM ,* du Y <v* dL v ,^ r ,\ /0 n\ 

~dV = ~ \ L ~d^ Xv + ^ Xu + Yl ^ y+ h - c -) ■ (20) 

Now derivatives of the original Wilson matrix, M and its upper and lower sections 
appear in Eqn. (|2(]) and so calculating the relevant force terms proceeds straightfor- 
wardly from here. 



2.2 Even-odd (eo) and two-stage (eo-ILU) preconditioning 

In Ref. Ea], it was demonstrated that the commonly used even-odd preconditioning 
of the fermion matrix can be written as an ILU decomposition where the ordering 
function, s(x) is simply on even lattice sites and 1 on odd sites. There is no ordering 
ambiguity for the sites on the even or odd sub-lattices since M does not contain 
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hopping terms that directly couple sites on the same sub-lattice. For this example, 
the inversion of (7 — L) and (/ — U) can be written explicitly. 
The preconditioned matrix is 



M = 



\ / ho -K-Aoe \ ( loo kA c 

Iff I \ -KAfo Iff I \ Iee 



ho 

I ee - k 2 A b 



(21) 



Pseudofermion degrees of freedom on the odd sites are completely decoupled from the 
gauge fields and can be discarded. The pseudofermions on even sites are then coupled 
via the sub-matrix, M ee — I ee — K 2 A eo A oe . 

Since M ee is "ultra-local" (the only non-zero elements of the matrix couple sites 
within a small neighbourhood) and its elements can be written explicitly, it can be 
ILU preconditioned once again. The global lexicographic version of this second level 
of preconditioning was investigated in Section 4 of Ref. If a new ordering func- 
tion, s e (x e ) is defined for sites on the even sub-lattice, then the (two-step) eo-ILU 
preconditioned matrix is 

Mee^il-L^M^I-Uee)- 1 , (22) 

with 

Lee — K ~ ^ A eo A oe 

L.y. ■ ( 2 3) 

y c <x c 

As before, the matrix determinant is left unchanged by this second level of precondi- 
tioning; 

det M ee = det M ee = det M. (24) 
From here, a two-step preconditioned pseudofermion action can be implemented in the 



HMC algorithm, following a similar construction to Sec. 2.1. The new pseudofermion 
action is 

S e = ^* e [Mt e M ee ]~V, (25) 

and, after introducing the auxiliary fields (on even sites only) Y e = MJ~ l (f) e and 
X e = M^Ye the derivative with respect to r becomes 

^ = -Y:^x e -x : ^ky e . (26) 

dr dr dT 

Finally, after introducing the new fields Yr, e = (I—Ll e )~ 1 Y e and X\je = {I—U e e)~ 1 X e , 
the derivative is 

= " (rle^Xve + ^X Ue + Yi^Ye + h.C.) . (27) 

dT \ dr dr dT ) 

This expression contains derivatives of the original even-odd matrix and its lower and 
upper components. The first term is efficiently computed using the reconstruction 
trick; Yl and Xjj arc defined on odd sites as Xjj = K,A oe Xue and Yjj = KA^ oe Yu e 
then 

—X Ue =Y L — 

and so the force term is readily computed. Unfortunately no such reconstruction can 
be used for the terms involving U ee and L ee - The reconstruction trick relies on the 
observation that the preconditioning matrices for the even-odd decomposition can be 
computed explicitly (as in Eqn. ^) . While these algebraically cumbersome force terms 
must be evaluated explicitly, their evaluation is still a simple, local computational task. 



YL-T^Xue^YZ—Xu (28) 
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3 Testing the method: the Schwinger model 



To investigate the benefits of preconditioning, simulations of the two-flavour Schwinger 
model were performed. This model is a U(l) gauge theory in 1 + 1 dimensions and is 
a convenient testing ground for algorithms intended for full QCD simulations as it is 
asymptotically free, confining and has a spontaneously broken chiral symmetry. The 
link variables are phases; J7 M (x) = exp i# M (x) and for this study, the compact U(l) 
gauge action is used 

Sg = ^J^ 1 - C0S ( 29 ) 
□ 

with On the sum of angles around the plaquette. 
3.1 ILU orderings 

Ref. [^| noted that the performance of ILU preconditioned inversion depends on the 
choice of ordering of the lattice sites. To study this, the spectral properties of the 
matrix were determined for a number of different schemes. The smallest and largest 
eigenvalues (and hence the condition number) of the hermitian matrix, Q (= 75M) 
were computed on an ensemble of 100 dynamical (3 — 4.0, k = 0.26, 32 x 32 lattices. 
The orderings tested were the standard even-odd checkerboard, a locally lexicographic 
(11) scheme and a strip -lexicographic (si) scheme. For the (ll n ) ordering, the lattice is 
first decomposed into an even-odd checkerboard ofnxn blocks then the sites in each 
block are indexed starting at the corner with smallest coordinates and progressing 
first along the x-axis until the end of the block is reached before moving onto the next 
y value. Note that (Hi) is just the traditional even-odd indexing and (11 n), where N 
is the lattice extent, denotes a global lexicographic ordering. The (si) ordering first 
breaks the lattice into 1 x N strips and then the sites on each strip are indexed in 
order. Fig. [j] illustrates these ordering schemes on a 4 x 4 lattice. 




(a) (b) (c) 

Figure 1: The index function, s(x) for various lattice preconditionings. Grey 
shading denote areas that are lexicographically ordered, (a) is the standard 
even-odd preconditioning, denoted ll\ (b) is the locally lexicographic scheme 
with block size 2, U2 and (c) is the strip-lexicographic scheme, sl\. 



Table [I] presents the spectral properties of the hermitian fermion matrix, Q for a 
number of these schemes computed on 100 dynamical configurations. The eigenvalues 
of Q with the largest and smallest magnitudes are given, along with their ratio, C and 
n;t er , the number of BiCGStab iterations required to invert the fermion matrix. All the 
eigenvalues of Q were computed using the Lanczos algorithm with reorthogonalisation. 
Their reliability was checked by testing det Q = det Q configuration-by-configuration. 
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Ordering 


l^lmin 


1 ^ 1 max 


C 


'''iter 


None 


0.01768(48) 


2.02173(11) 


123.7(3.7) 


501(11) 


eo (Hi) 


0.03511(94) 


1.56235(76) 


48.1(1.4) 


202(5) 


lh 


0.0419(11) 


1.9873(14) 


51.2(1.5) 


104(3) 


lh 


0.0439(11) 


2.8837(29) 


70.7(2.2) 


66.3(3) 


lljy 


0.0444(11) 


4.706(15) 


114.0(3.5) 


49.1(3) 


sli 


0.0443(12) 


2.7254(23) 


66.4(2.0) 


104(1) 



Table 1: Spectral properties of M (and Q) at = 4.0, k = 0.26 on a 32 x 32 
lattice for various ILU schemes. The labelling of the ILU schemes is described 
in the text. Results for the schemes with optimal condition number and solver 
performance are in bold. 



This relation was seen to hold to machine precision. Note that to ensure consistently 
accurate solutions in column 5, \Mx — y\/\y \ < 10~ 12 was used as the stopping criterion 
for the BiCGStab solver (ie. the reconstructed residual of M rather than M). 

These results confirm that the condition number of the preconditioned matrix 
and the solver performance is strongly dependent on the site ordering scheme. An 
unusual pattern emerges however; the preconditioning that leads to optimal solver 
performance (measured by the number of iterations required for convergence) is the 
global lexicographic scheme, 11 n but the scheme with the lowest condition number is 
the standard even-odd decomposition. These two optimal orderings are highlighted 
in Table [l]. The 11 n inverter out-performs the even-odd method by a factor of four in 
iterations. 

The mis-match between the optimal ordering for better conditioning and for inver- 
sion may be explained by closer examination of the eigenvalue spectrum. Fig. ^ shows 
the density of (the absolute value of) the eigenvalues of Q without preconditioning 
and using even-odd and global lexicographic ILU schemes. For the (11 n) scheme, most 
of the eigenvalues are distributed close to unity while a small number lie in a long tail 
stretching out to |A| = 5. This tail, while responsible for the high condition number is 
sparsely populated and so easily handled by the BiCGStab algorithm. The even-odd 
matrix has a hard upper bound on its eigenvalues and so is better conditioned, but 
these eigenvalues are broadly distributed inside that band. Note also that the per- 
formance results in Table. [I] are for inversion of M, not Q. The spectrum of M has 
eigenvalues that are generally complex and has not been computed here; solver perfor- 
mance may depend on the eigenvalue distribution in the complex plane. A heuristic 
argument for the better performance of the global-lexicographic ordering can be made 
by noting that each iteration of the solver couples every site on the lattice with every 
other site. 



3.2 ILU HMC algorithm performance 



The ordering schemes of Sec. 3.1 were used to precondition the matrix coupling pseud- 
ofermions in a set of dynamical fermion simulations on lattices of size 32 x 32. The 
gauge coupling was fixed to (3 = 4.0 throughout. For the computations of the X and 
Y fields, the BiCGStab algorithm was again used (the stopping criterion was based on 
the residual of the preconditioned solution vector, to reduce computational overheads). 
In the molecular dynamics leap-frog step, the Sexton- Weingarten integrator scheme 
II was implemented. Here, the force on the conjugate momenta is separated into two 
parts; one arising from the pseudofcrmion action and the other from the Yang-Mills 
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Figure 2: The eigenvalue probability distribution of Q and Q. 



term of Eqn. (g9|). By interleaving the different momenta and gauge field updates, 
the step-size used for integrating the Yang-Mills force term, dr g is then made smaller 
than the step-size for the pseudofermionic force, drj. This isolates any changes in 
the update hamiltonian arising from the preconditioned fermion determinant. For all 
simulations, dr g = jdrj was used. This value was chosen in some short tests and 
subsequently found to be sufficient to remove the finite-step-size effects of the gauge 
action in all runs. The molecular-dynamics trajectory length was selected randomly 
in the interval | < r < 1^ to ensure ergodicity. 

Table || shows the expectation values of the f x f and 4x4 Wilson loops from 
simulations using the preconditioning orderings of Sec. |3.f| . All simulations were per- 
formed on a 32 x 32 lattice, at (3 — 4.0, n = 0.2600. Wilson loop averages agree within 
statistical uncertainties, as expected. Also in Table || are the acceptance probabilities 
of the final Metropolis test at the end of each molecular dynamics trajectory and (for 
reference) the condition numbers, C from Table, [j] are included. While all the 1LU 
preconditioned simulations outperform the standard HMC algorithm, there is little 
variation among the different orderings. The optimal schemes use even-odd and strip- 
lexicographic ordering. The global-lexicographic scheme, while the best ordering for 
inversion, is the worst for HMC performance. It is possible that the higher condition 
number of Q in this scheme, due to the long tail of eigenvalues seen in the spectrum 
of Fig. |^, leads to instabilities in the molecular-dynamics evolution at a smaller step- 
size. Notice there is a correlation between the highest acceptance rates and the lowest 
condition numbers. 

3.3 eo-ILU preconditioning 

Sec. |2.2| introduced a two-step preconditioning scheme (eo-ILU), where initially an 
even-odd decomposition of the fermion matrix is applied, followed by an ILU precon- 
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Ordering 


N 

1 v sweep 


(W(l,l)) 


(W(4,A)) 


'ace 


C 


None 


5000 


0.87348(41) 


0.1800(31) 


0.120(11) 


123.7(3.7) 


eo (Hi) 


10000 


0.87407(14) 


0.18542(82) 


0.7310(74) 


48.1(1.4) 


lh 


1000 


0.87420(43) 


0.1815(25) 


0.721(17) 


51.2(1.5) 


ll A 


1000 


0.87416(34) 


0.1834(27) 


0.691(16) 


70.7(2.2) 


ll N 


1000 


0.87413(48) 


0.1842(30) 


0.647(21) 


114.0(3.5) 


sli 


1000 


0.87360(37) 


0.1826(27) 


0.745(14) 


66.4(2.0) 



Table 2: Wilson loop expectation values and acceptance probabilities from 
simulations using different preconditioning site ordering schemes. All simula- 
tions were performed on 32 x 32 lattices at j3 = 4.0, K = 0.26 with an MD 
step-size of dr = jr. 



ditioning. In the second step, an ordering scheme for the sites on the even (or odd) 
sites only is defined. In this section, the performance of the BiCGStab solver and 
HMC algorithm on eo-ILU matrices is presented. Three ordering schemes are tested; 
the first is the standard global-lexicographic scheme and the second two are the two 
local preconditionings, labelled A and B, which are the extension of even-odd precon- 
ditioning on the sub-lattice. In order to avoid ambiguity, four "flavours" of sub-lattices 
must be defined, since the even-odd matrix couples the eight sites with off-set vectors 
{±1, ±1}, {±2, 0} and {0, ±2}. These three preconditionings are illustrated for a 4 x 4 
lattice in Fig. |[ 




(a) (b) (c) 

Figure 3: Two-step eo-ILU preconditioning schemes, (a) and (b) are the two 
distinct local decompositions, (in two dimensions). In (c) the entire even sub- 
lattice is lexicographically ordered. 



Table |3| presents the eigenvalues of the hermitian eo-ILU preconditioned matrix, 
Q ee with the largest and smallest absolute values, along with the condition number 
and average number of BiCGStab solver iterations required for matrix inversion. The 
same pattern emerges as with the one-level ILU scheme comparisons. The best order- 
ing scheme to accelerate the BiCGStab algorithm is the global-lexicographic scheme, 
while the scheme with the smallest condition number is one of the local four-flavour 
preconditionings. These optimal schemes are highlighted bold in Table [| The two- 
level preconditioning scheme reduces the condition number of the matrix and the 
number of iterations required for inverting the fermion matrix even further than the 
single ILU scheme. The optimal ordering for matrix inversion out-performs the un- 
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Ordering 


l^lmin 


1 ^ I max 


C 




None 
eo-only 


0.02306(77) 
0.03511(94) 


2.02145(11) 
1.56235(76) 


99.5(3.9) 
48.1(1.4) 


501(11) 
202(5) 


global 
local (A) 
local (B) 


0.0838(21) 
0.0830(23) 
0.0900(24) 


5.906(43) 
1.85347(81) 
1.5719(10) 


76.1(2.5) 
24.5(1.0) 
18.82(55) 


31.2(1) 

68.0(5) 
76.9(7) 



Table 3: Solver and HMC algorithm performance at j3 = 4.0, n = 0.26 on a 
32 x 32 lattice for various eo-ILU schemes. 



preconditioned matrix by a factor of 16, while the optimal scheme for improving the 
condition number reduces this number by a factor of five. 

3.4 eo-ILU HMC performance 

The eo-ILU preconditioned pseudofermion action was tested in a set of HMC algo- 
rithm simulations and compared to the unpreconditioned and even-odd schemes for 



two fermion masses. The studies of ILU HMC in Sec. |3.2| found the benefits of pre- 
conditioning were rather weakly dependent on the site ordering, but that the schemes 
leading to the highest HMC acceptance rate were the local orderings, such as the 
familiar even-odd method. The lowest condition number was also found to be an in- 
dicator of the best acceptance rate. Based on this and the results in Table ||, local 
eo-ILU ordering B was used. One simulation using the A ordering was also considered, 
and found to give a slightly poorer performance than scheme B. These simulations 
were performed on 64 x 64 lattices at j3 = 4.0 and two k values, 0.2570 and 0.2605. 
These parameters corresponded to pseudoscalar meson masses of amp = 0.210(3) and 
amp = 0.124(5) respectively. 

The dependence of the acceptance probability on step-size for the two fermion 
masses are presented in Figs. ^ and || In these figures, fits to the expected small-step- 
size behaviour of the acceptance probability ^] are included, 

dr^ 2 



<7>acc) = crfc 1-1 , (30) 

where tq is a fit parameter and determines the characteristic scale of the equations 
of motion. t presents a reliable estimate of the algorithm efficiencies, assuming the 
autocorrelations along the Markov chains for the different preconditionings are the 
same at a fixed Metropolis acceptance. While these autocorrelations have not been 
studied in detail, this criterion does appear to hold approximately. Eqn. (^0|) models 
the expected acceptance rate only in the limit dr — > 0. The fit gives an unacceptable 
X 2 once the acceptance rate falls below approximately 80%. At the lighter fermion 
mass (k = 0.2605) the acceptance rate tended to break down rather suddenly as dr 
was increased. The UKQCD collaboration (To) studied this phenomena in detail, and 
concludes that the leap-frog integrator becomes unstable at a critical value of dr, which 
decreases as the light quark mass is reduced. To determine tq from a fit, the number of 
data points included was increased (starting with the smallest dr) until xV-^df > 1-5. 
These fits are summarised in Table |l|, along with the masses of the lightest mesons. 
Table ^ also includes the ratio of the characteristic molecular-dynamics time-scale of 
unpreconditioned HMC and the even-odd and eo-ILU schemes. For both values of the 
hopping parameter, the eo-ILU algorithm characteristic time is larger than standard 
HMC by a factor of about three and the even-odd method by about two. No significant 
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5> 0.4 
0.3 
0.2 
0.1 





A No preconditioning 
o Even-odd 
eo-ILU 



0.02 0.04 0.06 

dx 



0.08 



0.1 



Figure 4: Acceptance rate vs. molecular dynamics step-size for three 
preconditioning schemes. Dashed lines are fits to Eqn. (|30|). All simula- 
tions are performed on 64 x 64 lattices at (3 = 4.0, k = 0.2570 



mass dependence on this improvement is seen, although only two fermion masses were 
simulated. 
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Table 4: Fits to Eqn. ( |30| ) at two k values for three HMC algorithms with no 
preconditioning, one step (even-odd) and two step (eo-ILU) preconditioning, 
rifit indicates the number of points (starting from the lowest dr data) that could 
be included in the fit before an unacceptably high % 2 /iVdf was found. 
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Figure 5: Acceptance rate vs. molecular dynamics step-size for three 
preconditioning schemes. Dashed lines are fits to Eqn. (|30|). All simula- 
tions are performed on 64 x 64 lattices at f3 = 4.0, k = 0.2605 



4 Discussion: further implementations 

The preconditioned pseudofermion method, tested in the Schwinger model, can be 
applied directly in simulations of 4d gauge theories, such as QCD. As present, these 
computationally demanding calculations are being performed on large parallel com- 
puters. In Sec. ||, some correlation was seen between the performance of the precondi- 
tioned HMC algorithm and the condition number of the hermitian fermion matrix, Q. 
Finding the optimal site ordering is then reduced to minimising this condition number 
and a direct comparison of HMC performance from costly simulations can be avoided. 
Only a small set of orderings was tested in the two-dimensional study; the benefits of 
a given ordering will be dependent on the precise properties of the matrix in question 
and will differ in two and four dimensions. 

These simulations also demonstrated that a significant enhancement in the perfor- 
mance of the HMC algorithm was found by using a two-step (eo-ILU) preconditioning 
of the pseudofermions. A highly efficient ordering for increasing the acceptance of the 
Metropolis test is a local one; ie. matrix operations still involve only a small neigh- 
bourhood (out to two hops) around each site and do not require global lattice sweeps 
for their operation. This means these preconditionings can be applied to simulations 
on parallel computers, as many processors can be working on independent portions of 
the local eo-ILU matrix- vector operation. The four-dimensional lattice would required 
sixteen indices to cover the orderings of the even-odd matrix and it is unclear whether 
this would lead to prohibitively expensive or complex communications on parallel com- 
puters. A direct test seems the only way to assess this. Sec. || determined that the 
acceptance rate is not critically dependent on the ordering and a better scheme for 
efficient parallelism could be found. 



12 



Rcf. demonstrated that the Sheikholcslami-Wohlert (SW) action can be simu- 
lated using even-odd preconditioned HMC. As with the Wilson matrix, the even-odd 
SW matrix can be ILU preconditioned again, leading to an eo-ILU scheme with a 
possible faster inverter performance and a larger useful molecular-dynamics step-size. 
Many large-scale dynamical quark simulations (eg. CP-PACS, UKQCD; see Ref. jl^] 
for a review) use the SW fermion formulation. Lexicographic preconditioning has also 
been applied to more complex fermion matrices and can be extended to highly 
improved actions with interactions beyond nearest neighbours, such as the Symanzik- 
improved D234 action Q and fixed point actions (l^, [r| . This work suggests HMC 
simulations of these fermion actions can be accelerated, even though the standard 
even-odd decomposition does not decouple even and odd sub-lattices. Note also that 
other dynamical fermion methods, such as the "Kentucky algorithm" [[17J may also be 
enhanced by similar ILU or eo-ILU preconditioning. 

As a final remark here, it is worth examining the hopping parameter expansion of 
the inverse preconditioned matrices. The original Wilson matrix inverse has a term at 
0(k) while after any ILU preconditioning, the leading term appears at 0(k 2 ). With 
the two-step eo-ILU preconditioning, the lowest order term is now 0(k ). 

5 Conclusions 

In this paper, ILU matrix preconditioning has been applied directly to the pseud- 
ofermion action of the hybrid Monte Carlo algorithm. The optimal ordering scheme for 
a single level of preconditioning in the pseudofermion action are local, like the simple 
even-odd checkerboard, while for inverting the fermion matrix a global-lexicographic 
ordering is best. The performance of the HMC algorithm was seen to depend weakly 
on the ordering, but to correlate with the condition number of the fermion matrix, 
while inversion performance has a more complex behaviour and is influenced strongly 
by the site indexing. 

A two-level scheme was then introduced, in which the even-odd fermion matrix 
on a single sub-lattice was subsequently ILU preconditioned (eo-ILU). The global 
lexicographic ordering for this scheme was proposed in Section 4 of Ref. Q . In direct 
analogy, the optimal ordering on the sub-lattice for inversion was found to be a global 
lexicographic scheme, while the HMC algorithm performed best with a more local, 
"four-flavour" decomposition. This optimal two-level scheme was found to improve the 
performance of the HMC algorithm by a factor of two over even-odd pseudofcrmions 
and a factor of three relative to the unmodified HMC method. 
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